In 2009, the American Association for the Advancement of Science and the National Science Foundation released their call to action report called Vision and Change that recommended major changes in undergraduate biology education to reflect the changes in how advances in biology science occur in the 21st century. The authors of the report note “To contribute effectively to this”New Biology“, scientists need to interact with information in new ways, including being able to manage large, complex data sets. Systems approaches and biological modeling rely on the application of mathematics and statistical analysis, while the explosive generation of larger and larger data sets demands increasingly sophisticated computational knowledge.” (Bray et al. 2016; read the article here).
A fundamental element of workflows in ecology and evolution is the analysis of data. Most ecologists now commonly write code as part of their laboratory, field, or modeling research. The transition to a greater reliance on code has been driven by increases in the quantity and types of data used in ecological studies, alongside improvements in computing power and software. Code is written in programming languages such as R and Python, and is used by ecologists, evolutionary biologists, and bioimformaticians for a wide variety of tasks including manipulating, analyzing, and graphing data. A benefit of this transition to code-based analyses is that code provides a precise record of what has been done, making it easy to reproduce, adapt, and expand existing analyses.
The name “R” refers to the computational environment initially created by Robert Gentleman and Robert Ihaka, similar in nature to the “S” statistical environment developed at AT&T Bell Laboratories. It has since been developed and maintained by a strong team of core developers (R-core), who are renowned researchers in computational disciplines. R has gained wide acceptance as a reliable and powerful modern computational environment for statistical computing and visualisation, and is now used in many areas of scientific computation. Why bother learning it?
R is free and open source – R is free software, released under the GNU General Public License; this means anyone can see all its source code to see how R works, and there are no restrictive, costly licensing arrangements.. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.
R is interdisciplinary and extensible – With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more. Plus R is extensible, which means that procedures for analyzing or visualizing data that do not currently exist can (and probably will) be readily developed.
R works on data of all shapes and sizes – The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you. R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient. R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.
R has a large and welcoming community – Thousands of people use and extend R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow, or on the RStudio community.
R produces high-quality graphics and interactive web-based content – The plotting and web development functionalities in R are endless, and allow you to adjust any aspect of your graphics and visualizations to convey most effectively the message from your data.
R does not involve lots of pointing and clicking, and that’s a good thing – The learning curve might be steeper than with other software, but with R, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that’s a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results; you just have to run your script again.
R is great for reproducibility – Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis. R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically. An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.
R is a free software environment for data manipulation, statistics and graphical display. It allows you to load data from pretty much any kind of file, manipulate it, analyze it, and visualize it in pretty much any kind of way, and finally export the output as pretty much any kind of file. It can do pretty much anything then.
R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.
For much of this book, we will assume that you are using R via RStudio. First time users often confuse the two. At its simplest:
| R: Engine | RStudio: Dashboard |
|---|---|
More precisely, R is a programming language that runs computations while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.
RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
If you want to find out more about the difference between R and RStudio IDE, this video might be helpful DataCamp video.
You will first need to download and install both R and RStudio (Desktop version) on your computer.
If you had trouble with these two steps, we suggest you watch this DataCamp video.
Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.
The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, Inc.
We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.
RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.
RStudio is divided into 4 “Panes”: the Source for your scripts and documents (top-left, in the default layout), your Environment/History (top-right), your Files/Plots/Packages/Help/Viewer (bottom-right), and the R Console (bottom-left). The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout).
One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.
The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.
There are two main ways of interacting with R: by using the console or by using script files (plain text files that contain your code). The console pane (in RStudio, the bottom left panel) is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed. You can type commands directly into the console and press Enter to execute those commands, but they will be forgotten when you close the session.
If R is ready to accept commands, the R console shows a > prompt. If it receives a command (by typing, copy-pasting or sent from the script editor using Ctrl + Enter), R will try to execute it, and when ready, will show the results and come back with a new > prompt to wait for new commands.
If R is still waiting for you to enter more data because it isn’t complete yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. This is because you have not ‘closed’ a parenthesis or quotation, i.e. you don’t have the same number of left-parentheses as right-parentheses, or the same number of opening and closing quotation marks. When this happens, and you thought you finished typing your command, click inside the console window and press Esc; this will cancel the incomplete command and return you to the > prompt.
Let’s review some basics we’ve so far omitted in the interests of getting you plotting as quickly as possible. You can use R as a calculator:
You can create new objects with <-:
All R statements where you create objects, assignment statements, have the same form:
When reading that code say “object name gets value” in your head.
You will make lots of assignments and <- is a pain to type. Don’t be lazy and use =: it will work, but it will cause confusion later. Instead, use RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automagically surrounds <- with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.
So far you’ve been using the console to run code. That’s a great place to start. However, because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer.
The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open. Nevertheless, it’s a good idea to save your scripts regularly and to back them up.
RStudio allows you to execute commands directly from the script editor by using the Ctrl + Enter shortcut (on Macs, Cmd + Return will work, too). The command on the current line in the script (indicated by the cursor) or all of the commands in the currently selected text will be sent to the console and executed when you press Ctrl + Enter. You can find other keyboard shortcuts in this RStudio cheatsheet about the RStudio IDE.
One slightly confusing part of R is how it reports errors, warnings, and messages. The default theme in RStudio colors errors, warnings, and messages in red, which makes them seem like you did something wrong. However, seeing red text in the console is not always bad.
R will show red text in the console in three different situations:
Error in ggplot(...) : could not find function "ggplot", it means that the ggplot() function is not accessible because the package was not loaded with library(ggplot2), and thus you cannot use it.Warning: Removed 1 rows containing missing values (geom_point). R will still make the scatterplot with all the remaining values, but it’s warning you that one of the points isn’t there.dplyr package in Subsection @ref(package-loading) below, or when you read data saved in spreadsheet files with read_csv() as you’ll see in Chapter @ref(tidy). These are helpful diagnostic messages and they don’t stop your code from working.Remember, when you see red text in the console, don’t panic. It doesn’t necessarily mean anything is wrong.
Another point of confusion with many new R users is the idea of an R package. R packages extend the functionality of R by providing additional functions, data, and documentation. They are written by a world-wide community of R users and can be downloaded for free from the internet. For example, among the many packages we will use in this book are:
ggplot2 package for data visualization in Chapter @ref(viz).dplyr package for data wrangling in Chapter @ref(wrangling).moderndive package that accompanies this book.infer package for “tidy” and transparent statistical inference in Chapters @ref(confidence-intervals), @ref(hypothesis-testing), and @ref(inference-for-regression).A good analogy for R packages is they are like apps you can download onto a mobile phone:
| R: A new phone | R Packages: Apps you can download |
|---|---|
So R is like a new mobile phone: while it has a certain amount of features when you use it for the first time, it doesn’t have everything. R packages are like the apps you can download onto your phone from Apple’s App Store or Android’s Google Play.
Let’s continue this analogy by considering the Instagram app for editing and sharing pictures. Say you have purchased a new phone and you would like to share a recent photo you have taken on Instagram. You need to:
Once Instagram is open on your phone, you can then proceed to share your photo with your friends and family. The process is very similar for using an R package. You need to:
Let’s now show you how to perform these two steps for the ggplot2 package for data visualization.
There are two ways to install an R package. For example, to install the ggplot2 package:
ggplot2install.packages("ggplot2") in the Console pane of RStudio and hitting enter. Note you must include the quotation marks.Recall that after you’ve installed a package, you need to “load” it, in other words open it. We do this by using the library() command. For example, to load the ggplot2 package, run the following code in the Console pane. What do we mean by “run the following code”? Either type or copy & paste the following code into the Console pane and then hit the enter key.
If after running the above code, a blinking cursor returns next to the > “prompt” sign, it means you were successful and the ggplot2 package is now loaded and ready to use. If however, you get a red “error message” that reads…
Error in library(ggplot2) : there is no package called âggplot2â
… it means that you didn’t successfully install it. In that case, go back to the previous subsection “Package installation” and install it.
One extremely common mistake new R users make when wanting to use particular packages is they forget to “load” them first by using the library() command we just saw. Remember: you have to load each package you want to use every time you start RStudio. If you don’t first “load” a package, but attempt to use one of its features, you’ll see an error message similar to:
Error: could not find function
R is telling you that you are trying to use a function in a package that has not yet been “loaded.” Almost all new users forget do this when starting out, and it is a little annoying to get used. However, you’ll remember with practice.
Learning to code/program is very much like learning a foreign language, it can be very daunting and frustrating at first. Such frustrations are very common and it is very normal to feel discouraged as you learn. However just as with learning a foreign language, if you put in the effort and are not afraid to make mistakes, anybody can learn.
Here are a few useful tips to keep in mind as you learn to program:
RStudio help interface.
One of the fastest ways to get help, is to use the RStudio help interface. This panel by default can be found at the lower right hand panel of RStudio. As seen in the screenshot, by typing the word “Mean”, RStudio tries to also give a number of suggestions that you might be interested in. The description is then shown in the display window.
If you need help with a specific function, let’s say barplot(), you can type:
If you just need to remind yourself of the names of the arguments, you can use:
If you are looking for a function to do a particular task, you can use the help.search() function, which is called by the double question mark ??. However, this only looks through the installed packages for help pages with a match to your search request
If you can’t find what you are looking for, you can use the rdocumentation.org website that searches through the help files across all packages available.
Finally, a generic Google or internet search “R <task>” will often either send you to the appropriate package documentation or a helpful forum where someone else has already asked your question.
Start by googling the error message. However, this doesn’t always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”). If the message is very generic, you might also include the name of the function or package you’re using in your query.
However, you should check Stack Overflow. Search using the [r] tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: http://stackoverflow.com/questions/tagged/r
The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.
The R FAQ is dense and technical but it is full of useful information.